Read First

- http://www.unofficialgoogledatascience.com/2016/10/practical-advice-for-analysis-of-large.html

- https://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/

- http://blog.districtdatalabs.com/data-exploration-with-python-1

- https://cdn.ampproject.org/c/s/yanirseroussi.com/2016/08/21/seven-ways-to-be-data-driven-off-a-cliff/amp/

High Level Flow

1) Understand the domain expertise ask questions

2) Aquiring data

3) First glance data (can be skipped sometimes)

4) Sample if needed

5) Given a memory fitted dataframe we can future investigate

6) Consider missing values and outliers

7) Now we can do some heavy lifting exploration

Detailed Flow

1) Understand the domain expertise ask questions

- https://businessanalystlearnings.com/blog/2015/2/2/improve-your-domain-knowledge-with-this-list-of-free-courses

2) Aquiring data

You can aquire the data from following resources:

fetching data using sqlalchemy from within the ipython notebook https://github.com/catherinedevlin/ipython-sql
fetching data using neo4j from whitin the ipython notebook https://github.com/versae/ipython-cypher
fetching web data using beuitifulsoup from within the ipython notebook https://github.com/Psycojoker/ipython-beautifulsoup
Various File dumps in various formats (XML, JSON,CSV)

If one feel more comfortable with other file format you can use the following

conversion of xml to json using https://github.com/Inist-CNRS/node-xml2json-command
conversion of json to csv using https://github.com/jehiah/json2csv
conversion of csv to json using https://csvkit.readthedocs.io/en/0.9.1/scripts/csvjson.html
Note when one convert xml or json to csv he need to think about the granularity of the target file

3) First glance data (can be skipped sometimes)

Once you know the buisness proccess, and you have the data in your end, You should play around with your data.

For each file format there is diffent tools.

for every file format try open it up in relevant editor/excel etc
for JSON one can run incremental json "queries" using https://pypi.python.org/pypi/jmespath-terminal/0.2.1
JSON use , try opening it in text editor
for CSV one can run queries over csv using https://github.com/harelba/q, in addition one can see csv stats using http://csvkit.readthedocs.io/en/0.9.1/scripts/csvstat.html
Convert the data to dataframe and use excel like features on the data https://github.com/quantopian/qgrid

3) If data doesn't fit the RAM:

Most of the faster, general tools work on ram so we should pick a sample of the data which can represent it

We use the following:

Naively spliting the file to smaller files
- if its CSV one could using pandas's read_csv like http://stackoverflow.com/questions/20033861/how-can-i-split-a-large-file-csv-file-7gb-in-python
- if its JSON one could be using ijson like in https://www.dataquest.io/blog/python-json-tutorial/
- if its XML one could be using https://gist.github.com/benallard/8042835
Sampling randomly the files
- if its CSV there is an script https://gist.github.com/eyaltrabelsi/ebb8da1bad2b79cf732fccb432790780

5) Given a memory fitted dataframe we can future investigate

By Far most usefull tool i know about is panda profiling https://github.com/JosPolfliet/pandas-profiling it all us to know the following:

Data metadata info , Number of records number of bytes
Warnings regarding the data like high correlation missing values etc
Identify the schema of the data aka data type and category of the variables
Look at means, median, standard deviation and histograms to understand the distribution
Check Completeness , Are critical data values missing? A database with missing data values is not unusual, but when the information missing is critical, then completeness is an issue.
Check Conformity, Is the data following standard data definitions? For example, are dates in a standard format? Maintaining conformity to standard formats are important to maintaining consistent structure and nomenclature for sharing and internal data management. Are your data values correct
note: still missing Continuous Variables plotbox
note: Consider practical significance, small can be sometimes usefull and big can be useless

Another cool tool allow one to create pivottables on dataframe using https://github.com/nicolaskruchten/jupyter_pivottablejs it will allow us :

Slice our data

Another cool tool allow one to easily do bi variate analysis on dataframe using https://github.com/ayush1997/visualize_ML it will allow us :

Bi-variate Analysis finds out the relationship between two variables. Here, we look for association and disassociation between variables at a pre-defined significance level.

Another cool tool which allow visuzualization on dataframe using https://github.com/altair-viz/altair_widgets it will allow us :

Bi-variate Analysis finds out the relationship between two variables. Here, we look for association and disassociation between variables at a pre-defined significance level.
Small analysis

The following issues need to be taken into considerations in this part as well

Check Timeliness, Is the data available when expected and needed? Timeliness depends on the user’s expectations and needs. Relevant only for resources where acquiring the data is very fact
Check Consistency ,Does the data across several systems reflect the same information? If data is reported across multiple systems, it should have the same information.
Check Integrity Is the data valid across the relationships and can all the data in a database be traced and connected? For example, in a customer database there should be a valid customer/sales relationship.

6) Investigate outliers and missing values

Sometime the missing values holds pattern in them, we should using the following:

https://github.com/ResidentMario/missingno

</br>

Sometime the outliers are actualy the most intresting part of the data thus its very important part One can find outliers( Univariate and Multivariate) using the following:

http://stackoverflow.com/questions/23199796/detect-and-exclude-outliers-in-pandas-dataframe

after the outliers as been found one should exlore the data like in #5

7) Now we can do some heavy lifting exploration:

Now you suppose to understand the data and be able able to actual answer

The most powerfull tool here is sql, the following client allow acces to many sql dbs and to create visualization over then https://github.com/airbnb/superset

stuff to check in the future:

https://github.com/jeroenjanssens/scikit-sos